About 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime. There are two types of tumors: Benign and Malignant. Malignant tumor is cancerous and is dangerous if it is not treated in the early stages. It is also very expensive in the United States for the cancer diagnosis and treatment. Hence, in the search to find a cheaper diagnosis technique, we propose a few models like logistic regression, linear classification on PCA(Principal Component Analysis) and Random Forest, which can help diagnose breast cancer as benign or malignant.
Breast cancer is a type of cancer that develops from breast tissue and is often associated by a lump in the breast, change in breast shape, development of red and patchy skin, or fluid emanating from the nipple. The causes for breast cancer have not been fully understood till date. There are some genetic factors, and some environmental factors associated with its development. Breast cancer is preliminarily detected by a mammogram exam and confirmed by a biopsy.
There is no single measurement that can be used to determine whether a given sample is benign or malignant. In 2019, an estimated 268,600 new cases of invasive breast cancer are expected to be diagnosed in women in the U.S., along with 62,930 new cases of non-invasive (in situ) breast cancer. About 2,670 new cases of invasive breast cancer are expected to be diagnosed in men in 2019. A lifetime risk of breast cancer for man is about 1 in 883. Breast cancer incidence rates in the U.S. began decreasing in the year 2000, after increasing for the previous two decades. They dropped by 7% from 2002 to 2003 alone. One theory is that this decrease was partially due to the reduced use of hormone replacement therapy (HRT) by women after the results of a large study called the Women’s Health Initiative were published in 2002. These results suggested a connection between HRT and increased breast cancer risk.
There can be cancer because of 2 types of tumor: Benign and Malignant. - Benign tumor are non-malignant/non-cancerous tumor. A benign tumor is usually localized, and does not spread to other parts of the body. Most benign tumor respond well to treatment. However, if left untreated, some benign tumor can grow large and lead to serious disease because of their size. Benign tumor can also mimic malignant tumor, and so for this reason are sometimes treated. - Malignant tumor are cancerous growths. They are much dangerous than the benign tumor. They usually grow very rapidly. They are often resistant to treatment, may spread to other parts of the body and they sometimes recur after removal.
We have 9 attributes which can help us detect whether the tumor is benign or malignant. Let’s see what are those.
Clump Thickness: This is used to assess if cells are mono-layered or multi-layered. Benign cells tend to be grouped in mono-layers, while cancerous cells are often grouped in multi-layer.
Uniformity of Cell Size: It is used to evaluate the consistency in the size of cells in the sample. Cancer cells tend to vary in size. That is why this parameter is very valuable in determining whether the cells are cancerous or not.
Uniformity of Cell Shape: It is used to estimate the equality of cell shapes and identifies marginal variances because cancer cells tend to vary in shape.
Marginal Adhesion: Normal cells tend to stick together. Cancer cells tend to loose this ability. So loss of adhesion is a sign of malignancy.
Single Epithelial Cell Size: It is related to the uniformity. Epithelial cells that are significantly enlarged may be a malignant cell.
Bare Nuclei: This is a term used for nuclei that is not surrounded by cytoplasm. Those are typically seen in benign tumor.
Bland Chromatin: Describes a uniform texture of the nucleus seen in benign cell. In cancer cell, the chromatin tends to be coarser.
Normal Nucleoli: Nucleoli are small structures seen in the nucleus. In normal cell the nucleolus is usually very small if visible at all. In cancer cell the nucleoli become much more prominent, and sometimes there are more of them.
Mitoses: It is an estimate of the number of mitosis that has taken place. Larger the value, greater is the chance of malignancy
The above heatmap shows the correlation between variables. Darker the blue color higher positive correlation, darker the red color, higher negative correaltion. As we can see that all the variables are postively corelated with each other and with target variable. Size of the dot is proportional to absolute value of correlation.
To see how each variable is correlated with target variable, we have provided a clearer picture of the correlation values in a sorted manner.
Just to get a range of the variable values, boxplot has been plotted. For example, clump thickness ranges from 2 to 6 and has a median close to 4. In case of uniformity of cell size, uniformity of cell shape value starts from 1 and that itself is median that means at least 50% of the data has value for these variable 1. In case of mitoses, it is clear that data is congested at value 1 and there is not much variation except few outliers. So, how mitoses affect breast cancer is difficult to analyze from this dataset.
We have used Logistic Regression with backward elimination technique to reduce the number of variables. The process we followed was to check what impacts the residual change and eliminated the variables accordingly. We removed one variable at a time. Each time we removed the variable which had highest p value till we found all the variable with p value less than 0.01 as we had set significance level as 0.01. For example, when we did not eliminate any variable the model summary showed coefficient of p value for Uniformity_of_cell_size was highest with a value of 0.773024. So we eliminated Uniformity_of_cell_size. Similarly, one by one we removed 3 more variables. Finally, we were down to 4 predictors from 9. Clump thickness, Uniformity of cell shape, Bland Chromatin and Marginal Adhesion are our final 4 variables for glm model. Let us analyze their individual impact on target variable.
Clump thickness: Values for benign cells tends to be on lower end and for malignant cells values tends to higher in general
Bland chromatin, Uniformity of cell shape, Marginal adhesion: Values are highly dense at the lower end of the range and does not vary much for benign cells where in case of malignant cells, values have higher variance and are roughly spread across the entire range.
The graph represents how the probability of cell being malignant varies as the value for each of the variable increases. In all the 4 variables, as the values increases probability of cell being malignant increases. This corroborate the insight provided by density plot above where we saw that benignant cells value tend to be on the lower end.
To see how the class distribution varies with 2 variables we have plotted these graphs. We see that as the value for any variable increases from somewhere around 6.5 to 7.5, it is certain that cells are malignant. For values lower than 3 it is safe as they are benignant cells but for values between 3 and 6.5 it is uncertain.
To increase the complexity of the model and eventually accuracy of the prediction, we have decided to include the combinations of two variable as well. We have 4 variables so to choose 2 from 4 we had total 6 options and out of these 6 we have picked 3 combination based on their interaction plots and later through hit and trial, it was verified to perform the best.
##
## Call:
## glm(formula = Is_Malignant ~ Clump_thickness + Marginal_adhesion +
## Uniformity_of_cell_shape + Bland_Chromatin + Clump_thickness:Marginal_adhesion +
## Clump_thickness:Uniformity_of_cell_shape + Clump_thickness:Bland_Chromatin,
## family = "binomial", data = bc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.81582 -0.08713 -0.02247 0.06079 3.00260
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.92552 2.40593 -6.204
## Clump_thickness 1.58569 0.38883 4.078
## Marginal_adhesion 0.94342 0.31700 2.976
## Uniformity_of_cell_shape 1.25862 0.43419 2.899
## Bland_Chromatin 1.08953 0.38048 2.864
## Clump_thickness:Marginal_adhesion -0.10144 0.04981 -2.037
## Clump_thickness:Uniformity_of_cell_shape -0.09581 0.06822 -1.404
## Clump_thickness:Bland_Chromatin -0.07267 0.06364 -1.142
## Pr(>|z|)
## (Intercept) 5.52e-10 ***
## Clump_thickness 4.54e-05 ***
## Marginal_adhesion 0.00292 **
## Uniformity_of_cell_shape 0.00375 **
## Bland_Chromatin 0.00419 **
## Clump_thickness:Marginal_adhesion 0.04168 *
## Clump_thickness:Uniformity_of_cell_shape 0.16020
## Clump_thickness:Bland_Chromatin 0.25353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 121.97 on 675 degrees of freedom
## AIC: 137.97
##
## Number of Fisher Scoring iterations: 8
## [1] "Accuracy = 99.2481203007519 %"
It would also help us remove the multi-collinearity, which would help us ignore redundant features.
We can see that, after applying PCA, the maximum variance is captured by the first variable. The second and the third variable also contributes to the variance a little. So now we can check how the data looks using the most important 2 and 3 dimensions.
Let us draw a decision boundary on the data from the 2 dimensional PCA.
## [1] "Accuracy 0.970717423133236"
We experiment the model with different depths and check which one performs the best.
Accuracy for models with depths: 1,2,3,4,5,6 =
## [1] 97.07 96.78 97.36 96.93 96.93 97.36
Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (across all trees) that include the feature, proportionally to the number of samples it splits. We have plotted relative feature importance of the model that we have created.
We tried 3 different models: GLM, GLM with PCA and Random Forest. Accuracy of those models are 99.4% , 97% and 97.07% respectively. Random Forest was chosen to be the final model because it gave us pretty high accuracy just using DEPTH=1 and estimators=100.
We have done detailed analysis on which features are useful in predicting malignancy of breast cancer. We also gave a simple dendogram chart prepared from Random Forest that can help anyone to predict whether a tumor is benign or malignant from 9 given features. From our derived conclusions we can conclude that there are mainly 1 or 2 features of tumor cell which are most important in order to predict malignancy of tumor. Also, it is much important to identify cancerous nature of tumor cells in early stages as nearly 86% of patients could be cured if tumor is treated early. As a future work, we can test on more practical data to get conclusions on robustness of our model and improve our findings.
We thank Prof. Brad Luen and Seiji Sloan for guiding us throughout the project. The inspiration to try Random Forest Model and make simple dendogram like chart came from suggestions by Prof. Brad Luen during the presentations and we really thank him to guide us throughout the semester.